JAMIA Open — Latest Matching Preprints

1

A bibliometric review of explainable AI in diabetes risk prediction: Trends, gaps, and knowledge graph opportunities

Van, T. A.

2026-04-20 health informatics 10.64898/2026.04.16.26351069 medRxiv

Top 0.1%

13.8%

Show abstract

BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [->] Explainability [->] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.

2

Decision Curve Analysis for Evaluating Machine Learning Models for Next-Day Transfer Out of ICU

Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.

2026-04-21 health informatics 10.64898/2026.04.19.26351213 medRxiv

Top 0.1%

8.8%

Show abstract

Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.

3

MIMIC-IV-Phenotype-Atlas (MIPA) : A Publicly Available Dataset for EHR Phenotyping

Yamga, E.; Goudrar, R.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350888 medRxiv

Top 0.1%

8.4%

Show abstract

Introduction Secondary use of electronic health records (EHRs) often requires transforming raw clinical information into research-grade data. A central step in this process is EHR phenotyping - the identification of patient cohorts defined by specific medical conditions. Although numerous approaches exist, from ICD-based heuristics to supervised learning and large language models (LLMs), the field lacks standardized benchmark datasets, limiting reproducibility and hindering fair comparison across methods. Methods We developed the MIMIC-IV Phenotype Atlas (MIPA) dataset, an adaptation of MIMIC-IV that provides expert-annotated discharge summaries across 16 phenotypes of varying prevalence and complexity. Two independent clinicians reviewed and labeled the discharge summaries, resolving disagreements by consensus. In parallel, we implemented a processing pipeline that extracts multimodal EHR features and generates training, validation, and testing datasets for supervised phenotyping. To illustrate MIPA's utility, we benchmarked four phenotyping methods : ICD-based classifiers, keyword-driven Term Frequency-Inverse Document Frequency (TF-IDF) classifiers, supervised machine learning (ML) models, and LLMs on the task. Results The final MIPA corpus consists of 1,388 expert-annotated discharge summaries. Annotation reliability was high (mean document-level kappa = 0.805, mean label-level kappa = 0.771), with 91% of disagreements resolved through consensus review. MIPA provides high-quality phenotype labels paired with structured EHR features and predefined train/validation/test splits for each phenotype. In the benchmarking case study, LLMs achieved the highest F1 scores in 13 of 16 phenotypes, particularly for conditions requiring contextual interpretation of clinical narrative, while supervised ML offered moderate improvements over rule-based baselines. Conclusion MIPA is the first publicly available benchmark dataset dedicated to EHR phenotyping, combining expert-curated annotations, broad phenotype coverage, and a reproducible processing pipeline. By enabling standardized comparison across ICD-based heuristics, ML models, and LLMs, MIPA provides a durable reference resource to advance methodological development in automated phenotyping.

4

Research Paper on AuditMed: A Single-File, Browser-Based Clinical Evidence Audit Platform Architecture, Current Capabilities, and Proposed Applications in Drug Informatics and Pharmacy Education

Ferguson, D. J.

2026-04-20 health informatics 10.64898/2026.04.19.26351188 medRxiv

Top 0.2%

6.5%

Show abstract

BackgroundClinical pharmacists, trainees, and educators rely on multi-database literature retrieval and structured evidence synthesis to answer drug-information questions. Existing workflows require navigation across PubMed, DailyMed, LactMed, interaction checkers, and specialty guideline repositories with manual de-duplication, appraisal, and synthesis. Commercial platforms that integrate these functions are costly and often unavailable in community, rural, and international training contexts. ObjectiveThis report describes the architecture of AuditMed, a single-file, browser-based clinical evidence audit platform, and reports preliminary stress-test results against a complex multi-morbidity case corpus. AuditMed is intended for research and educational use and is not a substitute for clinical judgment or validated commercial clinical decision-support systems. MethodsAuditMed integrates nineteen free, publicly available clinical and biomedical application programming interfaces into a six-stage Search [->] Select [->] Parse [->] Analyze [->] Infer [->] Create pipeline and supports browser-local patient-case ingestion with regex-based HIPAA Safe Harbor de-identification. Preliminary stress-testing was conducted against eleven cases (Cases 30 through 40) from the Complex Clinical Case Compendium Software Validation Suite, each featuring over twenty concurrent active disease states. For each case, the one-click inference pipeline was executed with default settings and the full Clinical Inference Report was captured verbatim. No retrieval-sensitivity, synthesis-fidelity, or time-to-answer endpoints were pre-specified; the exercise was qualitative and oriented toward pipeline behavior under extreme multi-morbidity. ResultsThe pipeline completed without fatal errors for all eleven cases and produced a structured Clinical Inference Report in each instance. Quantitative-finding detection performed as designed for hematologic parameters and cardiac biomarkers. Two parser defects were identified and are reproduced in the appendix: an age-as-fever regex-precedence defect affecting seven cases and a diagnosis-versus-medication parsing defect affecting one case. Evidence-linkage rate varied from zero evidence-linked statements in seven cases to eleven in one case, reflecting dependence of the inference layer on MeSH-indexed literature coverage of the specific case diagnoses. ConclusionsAuditMed is an early-stage, open-source platform whose value at this stage is in providing a free, transparent, auditable workflow for multi-source evidence synthesis with explicit uncertainty flagging. The preliminary results document both robust end-to-end completion under extreme case complexity and specific, reproducible parser defects that will be addressed before formal evaluation. Planned evaluation studies are described.

5

Leveraging Predictive AI and LLM-Powered Trial Matching to Improve Clinical Trial Recruitment: A Usability Assessment of Trialshub

Blankson, P.-K.; Hussien, S.; Idris, F.; Trevillion, G.; Aslam, A.; Afani, A.; Dunlap, P.; Chepkorir, J.; Melgarejo, P.; Idris, M.

2026-04-20 health informatics 10.64898/2026.04.17.26351107 medRxiv

Top 0.2%

6.2%

Show abstract

BackgroundRecruitment remains a major barrier to timely clinical trial completion. Trialshub is an LLM-powered, chat-based platform intended to help users identify relevant trials and connect with coordinators to streamline recruitment workflows. ObjectiveTo evaluate the perceived usability and operational value of Trialshub, and identify implementation considerations for real-world deployment. MethodsA usability test was conducted at Morehouse School of Medicine for the Trialshub application. Purposively selected participants included clinical research coordinators and individuals with and without clinical trial search experience. Participants completed a pre-test survey assessing demographics, digital health information behaviors, and familiarity with AI tools, followed by a moderated usability session using a Trialshub prototype. Users completed scenario-based tasks (locating a breast cancer trial, reviewing results, and initiating coordinator contact) using a think-aloud protocol. Task ratings, screen recordings, and transcribed feedback were analyzed descriptively and thematically, and reported. ResultsParticipants reported high comfort with using digital tools and moderate-to-high familiarity with AI. Trialshubs chat-first design, guided prompts, and checklist-style eligibility display were perceived as intuitive and reduced cognitive load. Fast access to trials and the coordinator-contact workflow were viewed positively. Key usability issues included uncertainty at step transitions, insufficient cues for selecting results and next actions, and inconsistent system reliability (loading delays, errors, and broken trial detail pages). Participants also noted redundant questioning due to limited conversational memory, requested improved filtering/sorting, and clearer calls-to-action. All participants indicated that Trialshub has strong potential to meaningfully improve clinical trial processes. ConclusionsTrialshub shows promise for improving trial discovery and recruitment workflows, with identified design implications for real-world deployment.

6

Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics 10.64898/2026.04.23.26351098 medRxiv

Top 0.2%

6.2%

Show abstract

Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

7

Harmonising UK primary care prescription records for research: A case study in the UK Biobank

Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.

2026-04-22 health informatics 10.64898/2026.04.21.26351274 medRxiv

Top 0.3%

4.8%

Show abstract

Objective The aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and Methods We used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. Results We preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. Conclusion Our methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.

8

A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350890 medRxiv

Top 0.3%

4.3%

Show abstract

Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.

9

CohortContrast: An R Package for Enrichment-Based Identification of Clinically Relevant Concepts in OMOP CDM Data

Haug, M.; Ilves, N.; Umov, N.; Loorents, H.; Suvalov, H.; Tamm, S.; Oja, M.; Reisberg, S.; Vilo, J.; Kolde, R.

2026-04-23 health informatics 10.64898/2026.04.22.26351461 medRxiv

Top 0.4%

3.9%

Show abstract

Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.

10

The Visual Hemofilter: a novel visualization technology that improves task performance among intensive care professionals: A prospective simulation study.

Bider-Lunkiewicz, J.; Gasciauskaite, G.; Rück Perez, B.; Braun, J.; Willms, J.; Szekessy, H.; Nöthiger, C.; Hoffmann, M.; Milovanovic, P.; Keller, E.; Tscholl, D. W.

2026-04-20 intensive care and critical care medicine 10.64898/2026.04.16.26351012 medRxiv

Top 0.4%

3.7%

Show abstract

PurposeThis study evaluates the Visual Hemofilter, a novel decision-support and information transfer tool designed to assist with regional citrate anticoagulation (RCA) in hemofiltration. By representing hemofilter parameters and patient blood constituents as animated icons, the tool aims to improve clinicians interpretation of blood gas results and RCA reference tables. We hypothesized that the Visual Hemofilter would enhance clinical decision-making by enabling faster and more accurate therapy adjustments, increasing clinicians confidence in their decisions, and reducing cognitive workload compared to conventional methods. MethodsWe conducted a prospective, randomized, computer-based simulation study across four intensive care units at the University Hospital Zurich. Twenty-six critical care professionals participated, each managing regional citrate anticoagulation (RCA) scenarios using either the Visual Hemofilter or conventional methods involving blood gas analysis and reference tables. Following each scenario, participants made therapy adjustments and rated their decision confidence and cognitive workload. ResultsUse of the Visual Hemofilter significantly improved decision accuracy (odds ratio [OR] 3.96; 95% CI 2.03-7.73; p < 0.0001) and reduced decision time by an average of 33 seconds (mean difference -33.3 seconds; 95% CI -39.4 to -27.2; p < 0.0001). Participants also reported greater confidence in their decisions (OR 5.41; 95% CI 2.49-11.77; p < 0.0001) and experienced lower cognitive workload (mean difference -15.05 points on the NASA-TLX scale (National Aeronautics and Space Administration-Task Load Index); 95% CI -18.99 to -11.13; p < 0.0001). ConclusionsThe Visual Hemofilter enhances clinical decision-making in RCA by increasing accuracy and speed, boosting decision confidence, and reducing cognitive workload. This technology has the potential to reduce errors and better support critical care professionals in managing complex treatment scenarios.

11

Preconception metabolic-bariatric surgery and child health outcomes: Identification and cohort profile of the POSIT study protocol

Purnell, J. Q.; Getahun, D.; Vesco, K. K.; Qiu, S.; Shi, J. M.; Wong, C. P.; Koppolu, P.; Im, T. M.; Oshiro, C. E.; Boone-Heinonen, J.

2026-04-24 obstetrics and gynecology 10.64898/2026.04.22.26351521 medRxiv

Top 0.6%

2.5%

Show abstract

Preconception weight loss by metabolic-bariatric surgery (MBS) improves maternal-fetal outcomes, but little is known about its impact on offspring growth and health. The preconception bariatric surgery and child health outcomes (POSIT) study aims to estimate the effects of maternal MBS-induced preconception weight loss on infant and childhood body size, growth, and related outcomes. This report presents the methods used to construct the POSIT cohort and its baseline characteristics. This retrospective cohort study sampled members from a United States healthcare system aged 18 and older with a singleton, live birth to create three study groups: 1) a treatment group including women who underwent preconception MBS and subsequently became pregnant (n=1,374); 2) a control group matched to the MBS pre-surgery body mass index (BMI) (pre-surgery controls, n=13,740); and 3) a second control group matched to the MBS post-surgical, pre-pregnancy BMI (pre-pregnancy controls, n=13,740). MBS and pre-surgery BMI controls showed slight imbalances in that pre-surgery BMI controls were on average ~6 months younger, had 0.6 lower BMI (44.5 kg/m2) at the time of their pregnancy and were more likely to have become pregnant in earlier years than the MBS group prior to surgery. MBS and pre-pregnancy controls had comparable age (mean {+/-} SD 33 {+/-} 5 years), pre-pregnancy BMI (33 {+/-} 6 kg/m2), and year of delivery. Following matching, the MBS group had similar socioeconomic and health disparities as the pre-surgery control group, and both were worse than pre-pregnancy control group. Pregestational maternal comorbidity index improved after MBS and matched the pre-pregnancy controls. Upon extraction of offspring growth patterns and mediation analyses of maternal weight loss and metabolic responses to MBS, study findings will investigate effects of preconception weight loss by MBS on short- and long-term child health outcomes. Results will guide future studies focusing on improving maternal preconception weight and maternal-fetal outcomes.

12

Most Instability Phases Resolve: Empirical Evidence for Trajectory Plasticity in Multimorbidity Care from Longitudinal Relational Monitoring

Martin, C. M.; henderson, i.; Campbell, D.; Stockman, K.

2026-04-24 health informatics 10.64898/2026.04.22.26351537 medRxiv

Top 0.7%

1.9%

Show abstract

Background: The instability-plasticity framework proposes that multimorbidity trajectories periodically enter instability phases that are vulnerable to escalation but also potentially modifiable through relational intervention. Whether such phases commonly resolve without acute care, or predominantly progress to hospitalisation, has not been quantified at scale. Objective: To quantify instability window outcomes across a longitudinal monitoring cohort; to test whether the characteristics distinguishing admitted from resolved windows reflect within-patient trajectory dynamics or between-patient severity; and to characterise which patient-reported and operator-rated signals reliably precede admission, using both a curated pilot sub-cohort and the full monitoring cohort with an explicit cross-cohort comparison. Methods: Two complementary analyses were conducted on data from the MonashWatch Patient Journey Record (PaJR) relational telehealth system. Instability windows were identified algorithmically (>=2 consecutive calls with Total_Alerts >=3) across the full longitudinal dataset (16,383 calls, 244 patients, 2.5 years) and classified by linkage to ED and hospital admission data. Window characteristics were compared at window, patient, and paired within-patient levels. Pre-admission signal cascades were analysed in two configurations: a curated pilot sub-cohort (64 patients, 280 calls, +/-10-day window, 103 admissions, December 2016-September 2017) and the full monitoring cohort (175 patients, 1,180 pre-admission calls, +/-14-day window, December 2016-July 2019). A three-way cross-cohort comparison decomposed differences between the two configurations into pipeline and population effects. Results: 621 instability windows were identified across 157 patients (64% of the monitored cohort). 67.3% resolved without hospital admission or ED attendance, a rate stable across alert thresholds 1-5. In paired within-patient analysis (n = 70), duration in days (p = 0.002) and multi-domain breadth (p < 0.001) distinguished admitted from resolved windows; alert intensity did not. In the pilot sub-cohort, patient-reported illness prognosis (Q21) was the dominant pre-admission signal (GEE beta = +0.058, AUC = 0.647, p-BH = 0.018). This finding did not replicate in the full cohort: Q21 was non-significant (GEE beta = -0.008, p = 0.154, AUC = 0.507). Cross-cohort analysis identified selective curation of the pilot sub-cohort as the primary explanation. In the full cohort, six signals escalated significantly before admission after Benjamini-Hochberg correction: total alerts, health impairment (Q26), red alerts, self-rated health (Q3), patient concerns (Q1), and operator concern (Q34). Health impairment achieved the highest individual AUC (0.605) and showed the longest pre-admission lead. No individual signal exceeded AUC 0.61. Conclusions: Two thirds of instability phases resolve without hospitalisation, providing direct empirical support for trajectory plasticity as a clinically frequent phenomenon. Within the same patient, persistence - in duration and in the consistency of high-severity multi-domain flagging across calls - distinguishes trajectories that tip into admission from those that resolve. The Q21 signal reversal between cohorts illustrates how selective curation can produce compelling but non-replicable findings in monitoring research. In the full population, objective alert signals and operator judgement, rather than patient illness prognosis, carry the pre-admission signal

13

Comparing prognostic performance and reasoning between large language models and physicians

Gjertsen, M.; Yoon, W.; Afshar, M.; Temte, B.; Leding, B.; Halliday, S.; Bradley, K.; Kim, J.; Mitchell, J.; Sanders, A. K.; Croxford, E. L.; Caskey, J.; Churpek, M. M.; Mayampurath, A.; Gao, Y.; Miller, T.; Kruser, J. M.

2026-04-25 intensive care and critical care medicine 10.64898/2026.04.17.26350898 medRxiv

Top 0.7%

1.8%

Show abstract

Importance: Physicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. Objective: To characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. Design: Embedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. Setting: The publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. Participants: We randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measures: For each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. Results: Mean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and Relevance: In this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression.

14

Demystifying Clone-Censor-Weight Method in Target Trial Emulation: A Real-World Study of HPV Vaccination Strategies

Lin, T.; Li, Y.; Huang, Z.; Gui, T. T.; Wang, W.; Guo, Y.

2026-04-22 health informatics 10.64898/2026.04.21.26351413 medRxiv

Top 0.7%

1.8%

Show abstract

Target trial emulation (TTE) offers a principled way to estimate treatment effects using real-world observational data, but analyses of time-varying treatment strategies remain vulnerable to immortal time bias. The clone-censor-weight (CCW) approach is increasingly used to address this problem, yet key aspects of its causal interpretation and implementation remain unclear. In this work, we emulate a target trial using electronic health records (EHRs) to compare completion of a 3-dose 9-valent human papillomavirus vaccination (HPV) series within 12 months versus remaining partially vaccinated among vaccine initiators. We link CCW to the classic potential outcome framework in causal inference, evaluate the role of different weighting mechanisms, and account for within-subject correlation induced by cloning using cluster-robust variance estimation. Our study provides practical guidance for applying CCW in real-world comparative effectiveness studies to address immortal time bias and supports more rigorous and interpretable treatment effect estimation in TTE.

15

CGM glycemic persistence reflects OGTT dysglycemia

Zhang, R.

2026-04-23 endocrinology 10.64898/2026.04.22.26351476 medRxiv

Top 0.8%

1.8%

Show abstract

Aims The oral glucose tolerance test (OGTT) is effective for detecting post-load dysglycemia, but it is burdensome and therefore not routinely used. Continuous glucose monitoring (CGM) offers a convenient way to capture real-world glucose patterns, yet it remains unclear whether CGM-derived metrics reflect OGTT-defined dysglycemia. We therefore aimed to evaluate CGM-derived and clinical metrics for predicting OGTT 2-hour glucose, classifying OGTT-defined dysglycemia, and assessing day-to-day repeatability. Methods We analyzed a cohort with paired free-living CGM and OGTT. Multiple CGM-derived metrics and clinical measures were compared for prediction of OGTT 2-hour glucose, classification of OGTT-defined dysglycemia, and day-to-day stability. Predictive performance was assessed primarily by leave-one-out (LOO) R^2, and day-to-day repeatability by intraclass correlation coefficients (ICC). Results The glycemic persistence index (GPI), a metric integrating the magnitude and duration of glycemic elevation, was the strongest single predictor of OGTT 2-hour glucose (LOO R^2 = 0.439). GPI also showed strong day-to-day repeatability (ICC = 0.665) and ranked first on a combined prediction-stability score. For classification of OGTT-defined dysglycemia, HbA1c had a slightly higher AUC than GPI, but GPI plus HbA1c performed best overall, indicating complementary information. Conclusions GPI was a strong predictor of OGTT 2-hour glucose and showed a favorable balance between predictive performance and day-to-day stability, supporting its potential utility as a CGM-derived marker of dysglycemia.

16

Data Resource Profile: EST-Health-30

Reisberg, S.; Oja, M.; Mooses, K.; Tamm, S.; Sild, A.; Talvik, H.-A.; Laur, S.; Kolde, R.; Vilo, J.

2026-04-24 epidemiology 10.64898/2026.04.21.26351087 medRxiv

Top 0.9%

1.5%

Show abstract

Background: The increasing availability of routinely collected health data offers new opportunities for population-level research, yet access to comprehensive, linked, and standardised datasets remains limited. We describe EST-Health-30, a large-scale, population-representative health data resource from Estonia. Methods: EST-Health-30 comprises a random 30% sample of the Estonian population (~500,000 individuals), with longitudinal data from 2012 to 2024 and annual updates planned through 2026. Individual-level records are linked across five nationwide databases, including electronic health records, health insurance claims, prescription data, cancer registry, and cause of death records. A privacy-preserving hashing approach ensures consistent cohort inclusion over time while maintaining pseudonymisation. All data are harmonised to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (version 5.4) using international standard vocabularies. Data quality was assessed using established OMOP-based validation frameworks. Results: The dataset contains rich multimodal information on diagnoses, procedures, laboratory measurements, prescriptions, free-text clinical notes, healthcare utilisation, and costs, with high population coverage and longitudinal depth. Data quality assessment showed high completeness and consistency, with 99.2% of applicable checks passing. The age-sex distribution closely reflects the national population, supporting representativeness, though coverage is marginally below the target 30% (29.2%), primarily attributable to recent immigrants without health system contact. The dataset enables construction of detailed clinical cohorts, analysis of disease trajectories, and evaluation of healthcare utilisation and outcomes across the life course. Conclusions: EST-Health-30 is a comprehensive, standardised, and population-representative real-world data resource that supports epidemiological, clinical, and methodological research. Its alignment with the OMOP CDM facilitates reproducible analytics and participation in international federated research networks, while secure access infrastructure ensures compliance with data protection regulations.

17

Wavelet analysis reveals non-stationary cardiovascular rhythms associated with delirium and deep sedation in ICU patients

Sreekanth, J.; Salgado-Baez, E.; Edel, A.; Gruenewald, E.; Piper, S. K.; Spies, C.; Balzer, F.; Boie, S. D.

2026-04-23 health informatics 10.64898/2026.04.22.26351455 medRxiv

Top 0.9%

1.5%

Show abstract

Routine ICU data offers valuable insights into daily physiological rhythms. While traditional methods assume these cycles maintain fixed periods and amplitudes, their inherent variability requires dynamic estimation of instantaneous trends. Wavelet transform effectively resolves circadian oscillations, especially for frequently measured vital parameters. We present novel extensions to the Continuous Wavelet Transform (CWT) power spectral analysis to better detect and segment subtle temporal patterns. Using this approach, we uncover hidden circadian patterns in cardiovascular vitals such as Heart Rate (HR) and Mean Blood Pressure (MBP) measured over five days in a retrospective cohort of 855 ICU patients. By quantifying non-stationary rhythms, we identified diurnal and semi-diurnal oscillations varying in period and power according to delirium and deep sedation. Notably, HR exhibits a clear diurnal and semi-diurnal rhythm when delirium is absent. Overall, our framework supports the CWT as a powerful tool for analyzing complex physiological signals, particularly vital signs. Crucially, our findings suggest that cardiovascular rhythm disruption can be associated with ICU-related delirium and deep sedation.

18

When Data Meets Practice: A Qualitative Study of Clinician Perspectives on Streaming Data in Mental Health

Tian, J.; Kurkova, V.; Wu, Y.; Adu, M.; Hayward, J.; Greenshaw, A. J.; Cao, B.

2026-04-25 psychiatry and clinical psychology 10.64898/2026.04.23.26351640 medRxiv

Top 1.0%

1.3%

Show abstract

Patient-generated streaming data from wearable and digital technologies is increasingly promoted as a means of supporting mental health monitoring and clinical decision-making. While patient acceptance of these technologies has been reported, clinician perspectives remain underexplored despite their central role in determining whether streaming data are meaningfully integrated into routine care. This study explored clinicians experiences, as well as perceived facilitators and barriers, related to integrating patient-generated streaming data into routine mental health practice. A qualitative, exploratory interview study was conducted to examine clinicians experiences and perspectives on integrating patient-generated streaming data into mental health care. Semi-structured interviews were conducted with 33 clinicians, including family physicians (n=11), psychiatrists (n=12), and psychologists (n=10). Data were analyzed using reflexive thematic analysis guided by Braun and Clarkes six-step approach. Six themes were identified. Clinicians described variable use of digital and streaming technologies, ranging from routine engagement to deliberate non-use. Streaming data were viewed as clinically valuable when they provided longitudinal and objective insights, identified physiological and behavioural pattern changes, and supported patient engagement. However, clinicians emphasized that clinical usefulness was contingent on interpretability, contextual information, and relevance to decision-making. Major barriers included poor integration with electronic medical records, time constraints, data volume, limited organizational support, and uncertainty regarding data reliability and validity. Clinicians also expressed persistent concerns about privacy, governance, and regulatory oversight, highlighting the need for clear safeguards and accountability structures. Clinicians view patient-generated streaming data as a promising adjunct to mental health care, particularly for capturing longitudinal change between visits. However, meaningful clinical integration remains constrained by usability, workflow, organizational, and regulatory challenges, as well as limited confidence in data interpretation. Addressing these barriers through improved system integration, interpretive support, validation, and governance will be essential for translating the potential of streaming data into routine clinical practice.

19

Comparison of foundation models and transfer learning strategies for diabetic retinopathy classification

Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.

2026-04-20 health informatics 10.64898/2026.04.17.26351092 medRxiv

Top 1%

0.9%

Show abstract

Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.

20

Stakeholder perspectives on the use of enhanced mobile phone capabilities for public health surveillance for non-communicable disease risk factors: A qualitative study

Mwaka, E. S.; Nabukenya, S.; Kasiita, V.; Bagenda, G.; Rutebemberwa, E.; Ali, J.; Gibson, D.

2026-04-23 health informatics 10.64898/2026.04.22.26351443 medRxiv

Top 1%

0.8%

Show abstract

Background: Mobile phone-based tools are increasingly used to collect data on non-communicable disease (NCD) risk factors, particularly in low-resource settings where traditional data collection systems face operational and infrastructural constraints. This study examined stakeholder perspectives on the use of enhanced mobile phone-based capabilities to support the collection of public health surveillance data on NCD risk factors in low-resource settings. Methods: An exploratory qualitative study was conducted between November 2022 and July 2023. Twenty in-depth interviews were conducted with public health specialists, ethicists, NCD researchers, health informaticians, and policy makers in Uganda. Thematic analysis was used to interpret the results. Results: Four themes emerged from the data, including benefits of using mobile phone capabilities for NCD risk factor data collection; ethical, legal, and social implications; perceived challenges of using such mobile phone capabilities; and proposed solutions to improve the utility of phone-based capabilities in data collection on NCD risk factors. Participants recognized the potential of mobile technologies to improve data collection efficiency and expand access to hard-to-reach populations. However, concerns emerged regarding inadequate informed consent, risks to privacy and confidentiality, unclear data ownership, and vulnerabilities created by inconsistent enforcement of data protection laws. Social concerns included low digital literacy, unequal access to mobile devices, and fear of stigmatization. Participants emphasized the need for transparent communication, robust data governance, and community engagement. Conclusion: Mobile phone-based systems can strengthen the collection of NCD risk factor data in low-resource settings; however, their benefits depend on addressing key ethical, legal, and social challenges. To ensure responsible deployment, digital health initiatives must prioritize participant autonomy, data protection, equity, and trust building. Integrating contextualized ethical, legal, and social considerations into design and policy frameworks will be essential to leveraging mobile technologies in ways that support inclusive and effective NCD prevention and control.